1. corpnum folder
- This folder creates "corpnum-kiprisid" relationship from "1. dataguide.csv" via several python and stat codes listed below. 
- The final result is "corpnum_cleaned.dta" which will be later used by "all_matches.do" file in "4. cleaning" folder. 

1) dataguid.csv
- list of corporate registration number(���ε�Ϲ�ȣ) provided by DataGuide.

2) matching copr_num to KIPRIS ID.py
- python script crawling the kipris id via each firm's corporate number from "https://www.patent.go.kr/jsp/kiponet/mp/apagtinfo/ReadApAgtInfoInput.jsp"
- need to rewrite line 17 with your own chrome driver directory.
- need to install selenium package.
- result .csv file includes "firmname", "engname" ,"residnumber"(corporate registration number), "KIPRIS ID"

3) dataguide_cleaning.do 
- Standardizes string variables of "dataguide.csv" and cleans duplicated observations 
 (a) Standardizes addresses of firms at the municipal level (������ġ��ü ����) 
 (b) Standardizes Korean names of firms 
- The result is "dg_temp.dta" which is used in "KIPRIS_cleaning.do" below  

4) KIPRIS_cleaning.do 
- Standardizes string variables of "assignee.dta" from '1_KIPRIS > Biblio' and matches corporate numbers to the kiprisid of assignees appearing in "assignee.dta"
 (a) Standardizes addresses of assignees at the municipal level (������ġ��ü ����)
 (b) Standardizes Korean names of assignees 
 (c) Creates "assg_type" which shows the type of the assignee (firm, public entity or individual/ ���, �������, ���� �� ����) 
 (d) Matches DataGuide symbols to the kiprisid of assignees via corporate number 
- The results are "KIPRIS_cleaned.dta" and "corpnum_matched.dta" which are used in "corpnum_cleaned.do" below  

5) corpnum_cleaned.do
- Cleans duplicated matches of "corpnum - dataguide symbol" based on "loc2" and symbol type (A or B)
- The result is "corpnum_cleaned.dta" file which will be later used for making the final matching table in "4. cleaning" folder. 

==================================================
2. family folder 
- matching USPTO KR assignee ids to DataGuide Symbols 
- The final result is "family_match.csv" which should be converted to '.dta' file and used in "4. cleaning" folder 

1) family_makebasic.do 
- makes input files for "family.py" 
 (a) USPTO_KR_assg.dta: From the result of name standardization of USPTO KR assignees (uspto_namstand), it creates a list of temporary assignee ids & names 
 (b) assgid.csv: list of "patent application number (wku), assignee id (assgid), standard name (standard_name), stem name (stem_name)" which will be used to find a match based on family information
 (c) family_temp.csv: collects US "DOCDB" family patents granted from "family.dta" file 
 (d) assg_weight.csv: calculates the weight of each assignee for each patent 

2) family.py
- Finds DataGuide symbol to each kiprisid based on (b), (c), (d) above.
- The result is "family_match.csv" which should be converted to '.dta' file for further cleaning in "4. cleaning" 
==================================================
3. stringmatching folder 
- Python codes in this file will find the matches of 'Kiprisid - DataGuide Symbol' based on the string matching algorithm of NBER PDP and its variation. 
- The final results are 8 '.csv' files: 'matched_perfect_dg2k.csv' , 'matched_scorebased_dg2k.csv','matched_perfect_k2dg.csv', 'matched_scorebased_k2dg.csv', 'matched_uspto_perfect_d2k.csv', 'matched_uspto_scorebased_d2k.csv', 'matched_uspto_perfect_k2d.csv', 'matched_uspto_scorebased_k2d.csv' which should be converted into '.dta' files and will be used in "4. cleaning" folder. 

(1) kipris folder > Matching_dg2k.py & Matching_k2dg.py 
- Inputs: list of assignees and firms with 'standardized English names (standard_name, stem_name)', 'kiprisid', 'DataGuide symbol', and 'location information (loc1, lco2) 
 (a) Standardized English names of assignees with the kiprisid and location info. can be obtained by running "0. namestd > nameonly_main.do" where the input file is "KIPRIS_cleaned.dta" from "1. corpnum folder > KIPRIS_cleaning.do" 
 (b) Standardized English names of firms with the DataGuide Symbol  and location info. can be obtained by running "0. namestd > nameonly_main.do" where the input file is "dg_corpnum_unmatched.dta" from "1. corpnum folder > KIPRIS_cleaning.do" 
- Result: '.csv' file which gives "kiprisid - DataGuide symbol" based on the names of assignees and firms 
 (a) dg2k: calculates score based on the set of names that appear on the DataGuide sample
 (b) k2dg: calculates score based on the set of names that appear on the KIPO sample  

(2) uspto folder > Matching_uspto.py 
- Inputs: list of assignees and firms with 'standardized English names (standard_name, stem_name)', 'kiprisid',and  'DataGuide symbol' 
 (a) Standardized English names of assignees with the kiprisid can be obtained by running "0. namestd > nameonly_main.do" where the input file is "KIPRIS_cleaned.dta" from "1. corpnum folder > KIPRIS_cleaning.do" 
 (b) Standardized English names of firms with the DataGuide Symbol can be obtained by running "0. namestd > nameonly_main.do" where the input file is "dg_temp.dta" from "1. corpnum folder > KIPRIS_cleaning.do" 
- Result: '.csv' file which gives "uspto assignee id - DataGuide symbol" based on the names of assignees and firms 
- You need to change the INPT & TRGT to get the matching result of both 'uspto assignees to dataguide firms' & 'dataguide firms to uspto assignees' just like in Matching_dg2k.py & Matching_k2dg.py
==================================================
4. cleaning 

(1) uspto_matches.do 
- Input files: 'matched_uspto_perfect_d2k.csv', 'matched_uspto_scorebased_d2k.csv', 'matched_uspto_perfect_k2d.csv', 'matched_uspto_scorebased_k2d.csv'
- Cleans multiple matches of 'assignee id - DataGuide symbol' (i.e. multiple DG symbols matched to a single assignee id or the other way around) 
- Harmonization of assignee id if necessary (assgidH_dict.dta): needs to update the 'assgid' variable in the "assignee_uspto.dta" file in the end
- Incorporates the result of matches via family information from "family_match.csv" of "2. Family" folder 
- 'assgidH_dict.dta' file will be made from this code, which is the matching table between the temporary assignee id ('assgid') and the final assignee id ('assgidH') 
- Result: uspto_matches.dta 

(2) all_matches.do 
- Input file: 'matched_perfect_dg2k.csv' , 'matched_scorebased_dg2k.csv','matched_perfect_k2dg.csv', 'matched_scorebased_k2dg.csv'
- Cleans multiple matches of 'Kiprisid - DataGuide symbol' (i.e. multiple DG symbols matched to a single Kiprisid or the other way around) 
- Harmonization of Kiprisid if necessary (kip_harm.dta, kip_harm2.dta): needs to update the 'kiprisid' variable in the "assignee.dta" file in the end
- Incorporates the results of 'USPTO assignee id - DataGuide symbol' matches from above and 'Kiprisid - DataGuide symbol' matches via corporate number from "corpnum_cleaned.dta" of "1. corpnum" folder
- 'kip_harm.dta' and 'kip_harm2.dta' will be made from this code, which are the matching table between the temporary assignee id ('kiprisid') and the final assignee id ('kiprisidH') 
- Result: final_matches_updated2.dta 
